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Abstract 

The  phase-rotation  FFT  is  a  new  form  of  the  FFT  that  replaces  data  movement  with  multiplications  by 
constant  phasor  multipliers.  The  result  is  an  FFT  that  is  simple  to  pipeline.  This  paper  reports  some 
fundamental  new  improvements  to  the  original  phase-rotation  FFT  design,  provides  a  complete  description 
of  the  algorithm  directly  in  terms  of  the  parallel  pipeline,  and  describes  a  radix-2  implementation  on  the 
iWarp  computer  system  that  balances  compulation  and  communication  to  run  at  the  full-bandwidth  of  the 
communications  links,  regardless  of  the  input  data  set  size. 
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1.  Introduction 


The  Fast  Fourier  Transform  (FFT)  is  an  important  algorithm  with  many  applications  in  signal  processing 
and  scientific  computing.  The  Whelchel  phase-rotation  FFT  [9]  derives  from  the  Pease  constant-geometrv 
FFT  [7],  which  itself  derives  from  the  original  Cooley-Tukey  FFT  [4]  expressed  in  terms  of  Kronecker 
products. 

The  phase-rotation  FFT  of  radix  r  is  designed  for  a  pipeline  of  r  parallel  data  channels.  At  each  time 
step,  in  each  stage,  the  pipeline  carries  the  next  r  data  points,  one  from  each  channel,  into  a  Discrete  Fourier 
Transform  (DFT)  kernel.  Unlike  earlier  pipelined  KPT's  [5,  6|,  the  phase-rotation  FFT  has  the  key  property 
that  no  data  is  switched  across  channels,  except  within  the  DFT  kernel  and  at  the  input  and  output.  The 
phase-rotation  approach  extends  easily  to  higher  radices,  reducing  memory  and  latency  while  preserving 
the  high  throughput  and  parallel  shuffling  simplicity  of  lower  radix  versions.  The  phase-rotation  FFT  has 
also  been  extended  to  a  vector-radix,  multidimensional  parallel-pipeline  FFT  with  the  same  qualities  of  the 
one-dimensional  algorithm,  and  without  transposes  [101. 

This  paper  describes  the  results  of  a  project  to  implement  the  phase-rotation  FFT  on  a  parallel  computer 
system.  There  are  three  main  results:  First,  the  digit-reversing  shuffle  step  in  the  original  version  of  the 
phase-rotation  FFT  [9]  is  a  potential  pipeline  bottleneck  because  it  requires  communication  between  the  data 
channels.  We  describe  a  new  version  that  corrects  this  problem  by  using  a  parallel-pipeline  digit-reversing 
step. 

Second,  although  the  structure  of  the  phase-rotation  FFT  is  extremely  simple,  we  have  learned  from 
experience  that  generating  the  appropriate  twiddles  and  shuffle  indices  from  the  original  matrix  formula¬ 
tion  [9]  is  quite  difficult,  even  for  the  designers  of  the  algorithm!  To  try  to  help  the  implementer,  we  have 
reformulated  the  phase-rotation  FFT.  We  present  a  new  set  of  recipes  for  generating  the  tw  iddles  and  shuffle 
indices  directly  in  terms  of  the  parallel  pipeline. 

Finally,  we  describe  mapping  strategies  for  the  phase-rotation  FFT  on  the  iWarp,  a  parallel  computer 
system  developed  by  Intel  and  Carnegie  Mellon  [  1 , 2 1.  We  describe  a  line-grained  approach  for  an  A  -point 
radix-2  phase-rotation  FFT  that  balances  computation  and  communication  to  run  at  the  full  40  Mbytes/sec 
rate  of  the  iWarp  physical  links,  regardless  of  the  size  of  the  input  data  sets. 

Section  2  introduces  the  phase-rotation  concept.  Section  3  formally  defines  the  improved  FFT  algorithm. 
Section  4  gives  the  recipes  for  generating  the  twiddles  and  shuffle  indices  in  terms  of  the  parallel  pipeline. 
Finally,  Section  5  describes  the  full-bandwidth  implementation  on  iWarp. 


2.  The  basic  idea 


This  section  introduces  the  concept  of  the  phase-rotatton  FFT.  Starting  w  ith  the  Pease  constant-geometry 
FFT,  we  informally  derive  the  pipelined  phase-rotation  FFT,  identifying  the  key  insights  along  the  way. 
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2.1.  Constant-geometry  FFT 


Figure  1(a)  shows  the  flowgraph  fora  radix-r  .Y -point  decimation-in-frequency  (DIF)  constant-geometry 
FFT,  with  r  =  2  and  .V  =  rn  =  8.  There  are  n  stages.  Each  stage  contains  X  j  r  kernels.  Each  kernel  is  an 
operator  that  performs  an  r-point  DFT.  For  radix  2,  each  kernel  inputs  two  complex  numbers  and  outputs 
two  complex  numbers.  (For  simplicity,  twiddles  are  not  explicitly  shown  in  the  flowgraph.) 

Each  stage  in  the  constant-geometry  FFT  performs  an  identical  perfect  stride-by-.s  shuffle  of  its  data 
vector,  where  .?  -  X / r.  If  the  data  vector  is  regarded  as  an  ,s  x  r  array,  stored  in  column-major  order, 
then  the  perfect  shuffle  simply  transposes  it  into  an  r  x  .s  array.  For  example,  the  following  transpose  is  a 
stride-by-4  perfect  shuffle,  for  .V  =  8  points  and  radix  r  =  2: 

‘  0  4 

1  5 

2  6 

.  3  7 

The  data  items  in  example  above,  labeled  by  their  indices  in  the  original  column  vector,  are  regarded  as 
equivalent  to  a  4  x  2  array  composed  by  a  stride-by-4  unstacking  of  the  8-point  column  vector.  After  the 
transpose,  the  2  x  4  array  is  equivalent  to  a  new  8-point  column  vector  composed  by  a  stride-bv-2  stacking. 
As  we  shall  see,  this  transpose  creates  difficulties  when  we  try  to  pipeline  the  constant-geometry'  FFT.  And 
it  is  precisely  these  difficulties  that  the  phase-rotation  FFT  addresses. 
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2.2.  Pipelining  the  FFT 

Each  stage  of  the  constant-geometry  FFT  can  be  computed  on  a  single  processor  by  pipelining  the  data. 
For  example.  Figure  Kb)  shows  the  pipeline  for  a  single  stage  with  radix  r  =  2.  The  pipeline  consists 
of  a  sequence  of  operators  connected  by  pipeline  segments.  Each  pipeline  segment  consists  of  r  parallel 
channels.  Each  channel  carries  a  stream  of  X/r  data  points,  which  are  labeled  in  this  example  by  their 
indices  from  the  original  column  vectors  in  Figure  1(a).  For  each  pipeline  segment,  the  r  data  points  in  the 
same  position  in  each  stream  are  known  as  an  r-frame,  or  simply,  a  frame.  For  example,  in  Figure  1(b),  the 
first  frame  in  the  pipeline  segment  between  S  and  F  is  (0,4),  the  second  frame  is  ( 1,5),  and  so  on. 

At  each  time  step,  the  r  twiddle  operators  (T)  collectively  read  a  frame  (one  complex  number  per 
operator),  perform  an  element-wise  complex  multiplication,  and  write  the  resulting  frame.  Notice  that  each 
stream  is  operated  on  independently.  Similarly,  the  kernel  operator  (F)  reads  a  frame,  computes  the  radix-/- 
kernel,  and  writes  the  resulting  frame.  In  this  case,  the  streams  are  not  independent;  each  data  item  in  the 
output  frame  is  a  function  of  every  data  point  in  the  input  frame. 

The  twiddle  and  kernel  operators  pipeline  nicely  because  during  each  time  step  they  independently  read 
and  write  a  single  number.  However,  the  pipelined  shuffle  operator  (S)  is  less  well  behaved.  To  produce 
one  output  fr  <mc,  the  shuffle  operator  must  read  and  store  the  r  data  points  from  each  stream.  Thus,  S 
requires  r  memory  cycles  to  produce  each  frame.  (Notice  that  S  transposes  the  data  directlv  into  an  r  >.  > 
pipeline  segment,  even  starting  with  data  already  in  an  r  a  pipeline,  S  still  performs  •  row-to-coiumn" 
motions.)  This  is  an  example  of  the  memory-hank  conflict  discussed  in  |8,  pp.31-32).  The  conflict  is  clear 
in  Figure  I  (b).  To  assemble  its  first  output  frame,  S  must  read  both  0  and  4  from  the  upper  stream  to  its  left. 
Then  it  must  read  I  and  5  from  the  lower  stream,  and  so  on. 
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Figure  I:  Derivation  of  the  phase-rotation  FFT.  (a)  Initial  con¬ 
stant-geometry  FFT.  (b)  Pipelined  constant  geometry  FFT.  (c)  Pipelined  FIT 
based  on  cyclic  rotations,  (d)  Pipelined  phase-rotation  FIT. 
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Figure  2:  Replacing  the  perfect  shuffle  with  three  simpler  shuffles. 

We  would  like  to  replace  the  troublesome  perfect  shuffle  operation  with  a  parallel-pipeline  shuffle,  where 
each  stream  is  read  and  written  independently  and  in  parallel.  The  next  section  describes  the  insights  that 
make  this  possible. 


2.3.  The  phase-rotation  concept 

This  section  describes  how  to  replace  the  perfect  shuffle  by  a  parallel-pipeline  shuffle,  so  that  we  can  access 
the  data  streams  in  parallel.  The  basic  idea  is  to  rotate  the  data  within  frames,  and  then  compensate  for  these 
motions  by  phase  rotations  of  the  twiddle  factors. 

We  begin  with  a  “detour”  around  the  perfect  shuffle.  That  is,  we  find  a  sequence  of  three  simpler  shuffles 
that  is  equivalent  to  the  perfect  shuffle.  This  idea  is  shown  graphically  in  Figure  2  for  an  A  -point  radix-2 
example.  Each  radix-2  pipeline  segment  is  represented  as  matrix.  Each  row  in  the  matrix  corresponds  to  a 
stream,  and  each  column  corresponds  to  a  frame.  Frames  (columns)  are  arranged  left-to-right  in  reverse-time 
order  in  the  matrix. 

The  first  step  in  Figure  2  is  a  set  of  cyclic  rotations,  called  ,  which  rotates  each  frame.  These 
rotations  are  frame-wise  in  the  sense  that  only  data  points  contained  in  the  same  frame  are  rotated  across 
the  streams.  Notice  that  in  the  radix-2  case,  half  of  the  rotations  leave  the  corresponding  frame  unchanged. 
The  next  step  is  a  parallel-pipeline  shuffle  S,  which  permutes  the  data  in  each  stream.  Notice  that  no  data 
points  need  to  be  transferred  between  streams  in  this  step.  The  last  step  is  anothei  set  of  frame-wise  cyclic 
rotations,  called  Cjast<  which  leave  the  data  in  the  same  order  that  the  perfect  shuffle  would.  Note  that 
Cshw  and  C /ast  change  the  number  of  rotations  per  frame  at  different  paces,  one  slow  and  one  fast.  These 
varying  rates  are  difficult  to  see  in  the  radix-2  case,  but  are  much  more  apparent  in  the  higher-radix  cases. 

If  we  apply  the  idea  in  Figure  2)  to  each  stage  of  the  pipelined  FFT  in  Figure  I  (b).  replacing  each  perfect 
shuffle  with  three  simpler  shuffles,  we  get  a  pipelined  FFT  based  on  cyclic  rotations,  which  is  shown  in 
Figure  1(c). 

The  kind  of  basic  frame-wise  rotations  in  Figure  1(c)  that  is  applied  at  slow-varying,  and  then  fast- 
varying  rates,  is  represented  in  general  by  the  r/r  cyclic  (circular)  shift  permutation  matrix  C,,  made  by 
permuting  the  rows  of  the  identity  matrix  down  by  one  row,  and  moving  the  bottom  row  up  to  the  top.  For 
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Figure  3:  Interpretation  of  FrC,.  =  D,F, 
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The  key  insight  of  the  phase-rotation  FFT  is  that  the  cyclic  shift  theorem  for  the  DFT  can  be  applied  to  the 
cyclic  shift  operators  in  Figure  1(c).  In  matrix  form,  the  cyclic  shift  theorem  for  a  DFT  is  the  relation 

FrCr  =  D,Fr.  (1) 

where  Dr  =  diag{  . )  is  a  set  of  twiddles,  and  the  DFT  matrix  of  size  ;•  is 


F  r 


) 


r— I 

j.k= O' 


2rrt 

where  =  e~  T .  For  the  pipelined  FFT,  (1)  says  that  phasor  multipliers  after  a  DFT  kernel  give  the 
same  effect  as  physical  data  rotations  before  the  DFT  kernel.  Likewise,  physical  rotations  after  the  kernel 
are  equivalent  to  phasor  multipliers  before  it.  The  meaning  of  (1)  is  shown  graphically  in  Figure  3  for  a 
pipelined  radix-2  kernel.  The  shift  theorem  implies  that  the  data  rotations  in  Figure  1(c)  can  be  replaced 
by  constant  phasor  multipliers.  These  phasors  can  then  be  absorbed  by  the  twiddle  factors  on  either  side  of 
the  kernel,  leaving  only  a  parallel-pipeline  shuffle.  The  result  is  the  pipelined  phase-rotation  FFT.  which  is 
shown  in  Figure  1(d).  This  completes  the  informal  derivation  of  the  phase-rotation  FFT. 


The  structure  of  the  phase-rotation  FFT  in  Figure  1(d)  is  quite  similar  to  the  original  pipelined  FFT  in 
Figure  Kb).  The  twiddle  operators  ( D ')  are  identical  to  the  original  twiddle  operators  (7  ),  except  now 
the  twiddles  incorporate  the  original  twiddles,  phasor  multipliers  for  the  C  j.ust  operator  from  the  previous 
stage,  and  phasor  multipliers  for  the  C operator  from  the  next  stage.  The  kernel  operators  are  identical 
as  well.  The  important  difference  is  that  the  troublesome  perfect  shuffle  operator  has  now  been  replaced  by 
a  parallel-pipeline  shuffle  that  requires  no  communication  across  the  streams. 


There  are  several  other  important  properties  of  the  phase-rotation  FFT.  First,  there  are  no  additional 
multiplications  or  additions,  compared  to  the  original  pipelined  FFT  in  Figure  1(a).  Second,  the  only 
internal  communication  across  streams  occurs  at  the  kernel,  and  this  communication  is  constrained,  in  that 
only  data  points  within  a  single  frame  need  to  be  switched  across  channels,  and  the  switching  is  fixed  for 
all  frames. 


3.  Improved  phase-rotation  FFT 


In  this  section  we  give  a  formal  definition  of  an  improved  version  of  the  original  phase-rotation  FFT 
described  in  (9).  The  new  version  replaces  the  digit-reversing  permutation  in  the  original  phase-rotation 
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FFT  with  a  parallel-pipeline  shuffle,  followed  by  a  frame-wise  cyclic  rotation.  The  advantage  of  this  new 
approach  is  that  during  the  digit-reversing  step  at  the  end,  all  communication  between  streams  is  limited  to 
data  points  within  a  single  frame. 


For  radix  r  and  .V  =  rnpoints(a  >  I),  the  1 -dimensional  phase-rotation  FFT  is  a  matrix  factorization  of 
the  .V-point  DFT  matrix  F\.  Starting  with  the  Pease  constant-geometry  factorization,  we  replace  its  perfect 
shuffles  S  by  S  =  C f,lstSCsi0W.  Similarly,  at  the  left  end  we  replace  the  radix-/'  index-digit-reversing 
permutation  Q  =  Q,v,r  of  .V  data  points  by  Q  =  Cjlou,QC3i0,r,  where  Q  is  another  parallel-pipeline 
shuffle  that  will  be  defined  formally  in  Section  4.  The  phase-rotation  FFT  is  then  defined  by: 


f.v  =  q  n^) 

j= i 


-  ^slow  '  QD/ast 


vigorous 

algebraic 

shuffling 


n  (fsd;) 

j= i 


Cjloir 


(2) 


Let  s  =  \/r  as  before,  and  r'  =  \'/r2.  F  is  a  direct  (tensor,  Kronecker)  product  I,  Fr  = 

diag(Fr.Fr . Fr).  We  interpret  this  as  a  kernel  DFT  Fr  operating  on  .s  successive  frames  of  r  points 

placed  in  the  pipeline.  For  j  =  1  :  n,  the  other  parts  of  (2)  are  defined  by 
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D'„  =  D"_T', 


D' 


(3) 


The  direct  sums  are  of  the  form 

r—  l 

0  Afc  =  diag(  A0.  A i . AP-i ). 

k-0 

and  Ar  denotes  the  transpose  of  A.  See  [10]  for  more  on  the  basic  definitions  and  relations  used  to  derive 
(2),  as  well  as  the  generalization  to  higher  dimension  FFT’s. 


Note  that  the  stages  in  (2)  are  counted  in  reverse  time  order  by  the  index  j.  This  is  in  keeping  with  the 
fact  that  (2)  is  a  decimation-in-frequency  (DIF)  version  of  the  FFT.  The  transpose  of  (2).  with  the  product 
is  the  decimation-in-time  (DIT)  version  of  the  phase-rotation  FFT. 

A  C siow  shuffle  and  its  inverse  remain  at  the  input  and  output  ends  of  the  pipeline,  respectively.  As 
we  have  seen,  Csiow  is  a  completely  frame-wise  rotation.  It  rotates  (commutates)  the  data  within  each 
successive  frame  (column  r-vector)  of  the  r  x  s  pipeline  segment  for  a  stage.  There  is  also  an  implicit 
frame-wise  broadcast  within  each  FFT  kernel  engine,  when  an  r-point  DFT  is  somehow  computed.  So  in 
the  phase-rotation  FFT,  data  motion  is  all  parallel,  except  for  frame-wise  motions  at  I/O  and  at  every  FFT 
kernel.  The  simplicity  of  the  phase-rotation  FFT  is  that  no  data  point  ever  moves  both  down  and  across  the 
pipeline  in  one  time-step. 


4.  Pipeline  recipes 

While  the  structure  of  the  pipelined  phase-rotation  FFT  is  extremely  simple,  experience  has  taught  us 
that  generating  the  appropriate  twiddles  and  shuffle  indices  from  the  matrix  formulations  of  (2)  and  (3)  is 
difficult  and  confusing.  To  address  this  problem,  we  have  developed  a  collection  of  recipes  for  generating 
the  phase-rotation  twiddles  and  shuffle  indices  off-line.  The  recipes  are  defined  for  any  ID  phase-rotation 
FFT  of  ,V  =  rn  points.  Following  [8],  they  are  written  in  a  M.ATLAB-like  format. 

As  we  saw  in  (2),  the  pipelined  phase-rotation  FFT  performs  a  typical  “twiddle,  shuffle,  kernel"  cycle 
at  each  stage.  Only  the  twiddles  vary  from  stage  to  stage,  and  there  is  a  digit-reversing  shuffle  equivalent 
at  the  end.  To  implement  this  FFT  using  parallel  r  x  *  pipeline  segments  (one  per  stage),  we  insert  the 
.Y-vector  of  input  data  x  into  the  pipeline  as  an  r  /  .s  array  .Y :  the  first  r  points  of  x  go  into  the  first  frame 
(column)  A',  the  second  r  points  go  into  the  second  frame,  and  so  on.  We  must  also  have  a  shuffle  address 
and  a  twiddle  factor  ready  for  each  point  in  the  pipeline.  In  other  words,  we  would  like  to  till  one  r  ■  - 
copy  A  of  the  pipeline  segment  with  addresses,  and  another  copy  D  with  twiddles. 

Then  the  processors  in  each  stage  of  the  pipeline  will  know  what  to  do  at  each  time-step  /  =  0:s  -  I . 
Using  the  current  frame  of  addresses,  they  will  fetch  the  current  r-frame  of  data  ,Y  ( 0: r  —  1 .  .1(0:  /■  —  1 
and  the  current  r-frame  of  twiddles  D(0:r  -  l..-l(0:r  -  l.t))  (pointwise  in  parallel),  multiply  these  two 
frames  pointwise,  then  do  an  r-point  DFT  Fr  of  the  twiddled  data  frame.  That  is  how  each  stage  FSD'  is 
implemented  in  the  parallel  pipeline. 

The  twiddle  and  shuffle  recipes  in  this  section  are  "in  place”  in  the  sense  that  they  work  in.ide  the 
r  x  $  pipeline  segments  that  will  contain  the  desired  addresses  and  twiddles.  They  are  not  “in  place"  in  the 
usual  sense,  as  we  will  freely  use  an  input  -nd  an  output  copy  of  a  pipeline  segment.  This  approach  avoids 
constructing  and  operating  with  large  .V  x  ,V  matrices  (each  containing  only  .V  non-zero  elements).  Bach 
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parallel-pipeline  function  recipe  is  given  a  name  similar  to  that  of  the  .V  x  .V  matrix  factor  in  the  FFT  (2) 
that  it  effectively  implements. 


4.1.  Shuffle  recipes 

As  a  convention,  pipeline  addresses  (pipeline  array  row  and  column  indices)  run  0:r  -  I  and  ()>  -  I, 
respectively.  To  do  parallel-pipeline  shuffles,  we  only  need  the  horizontal  (column)  addresses,  since  the 
data  inside  each  pipe  will  only  jump  within  that  stream  (row).  The  cross-stream  shuffles,  Cslow  andCfast.  are 
implemented  using  rrr,  a  cyclic  rotation  of  a  frame  (a  vertical  slice  of  the  parallel  pipeline)  that  has  the  effect 
of  yr  =  Crxr.  ~r  takes  a  column  r-vector  xr  =  I  i-q.  jj.  .o . rr_(  )r  ■—  y„  -  ( ,rr_  | .  .r().  .>•  ( . r . 

function  Y  =  Cslow(.V) 
col  =  0 
for  k  =  1  :  r 
for  j  =  1  :  r1 

Y(:.col)  =  ~*(X(:.col)) 
col  -  col  +  1 

end 

end 

function  Y  =  Cfast(  X ) 
col  =  0 
for  j  =  1  :  r 
for  k  =  1  :  r' 

Y(:.col)  -  irf(X(:.col)) 
col  —  col  4-  1 
end 
end 


The  inverses  of  Cslow  and  Cfast  are  formed  by  simply  reversing  ~r.  Next,  we  detine  some  perfect  shuffles. 

function  Y  =  S(  X )  Istride  by  -s 
col  =  0 

for  roir  =  0  :  r  —  1 
for  k  1  =  0  :  r  :  .s  -  r 
k2  =  kl  +  r-  \ 

Y{row.k\  :  k2)  =  X(:.col) 
col  =  col  +  I 

end 

end 

function  Y  =  S- 1  ( .V )  Istride  by  r 
col  =  0 

for  row  —()r—  1 
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for  A:  1  =  0  :  r  :  .s  -  r 
k2  -  k\  +  r  -  1 
Y(\,col)  =  X(rmr.k\  :  k2) 
col  —  col  +  1 

end 

end 


To  implement  the  parallel-pipeline  shuffles,  S,  S  ,  and  Q,  we  will  use  ;he  parallel-pipeline  addresses  1, 
which  are  computed  by  the  following  function: 

function  .1  =  S -addresses!  r.  * ) 

a  =  (0.  r' . ( r  -  1  )r')T 

col  —  0 
for  j  =  1  :  r' 
for  k  =  ]  :  r 

li  ..rol)  =  a 
col  =  rol  4-  I 
a  =  Tr(a) 

end 

a  —  a  -f-  1  r 
end 

Looking  closely,  one  can  see  Cfast" 1  at  work  producing  the  addresses  1  in  the  last  function.  The  addresses 
A  can  also  be  generated  by  loading  a  pipeline  segment  with  simple  r  :<  r  address  blocks  /L-.  and  then 
applying  Cfast-1  to  the  pipeline  segment.  The  first  block  to  load  is  Brr  =  diagf ():/■'>  -  1 )  +  lrr,  where  1 ... 
is  the  r  :<  r  matrix  whose  entries  are  all  l’s.  The  next  block  is  always  Brr  =  Brr  -f  1,  -,  until  the  pipeline 
segment  contains  r'  blocks  and  is  full. 

function  V'  =  S(.V) 

A  =  S_addresses(  r.  s ) 
for  roir  =  0  :  r  -  1 

Vi  roic. : )  =  A’f  row.  ,  l(  row. : ) ) 
end 


function  >  =  S  (  V) 

•l  =  S .addresses!  r.  * ) 

[.1.1. 1]  ~  sort(.l) 
for  row  =  0  :  r  —  1 

Ylrmr,:)  —  .V (row.  /(  row. :)) 

end 


In  the  above  functions,  sort(.-l)  sorts  each  row  of  an  array  .1  in  ascending  order.  It  returns  the  row-sorted 
array  .-1/1  and  the  corresponding  array  of  addresses  /  where  the  successive  row  elements  were  found  in  . I. 

After  we  have  sorted  the  addresses  A  for  S,  /  has  the  addresses  for  S 
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The  pipeline  addresses  for  Q  are  obtained  by  block-perfect  shuffles  (along  the  length  of  the  pipeline)  of 
the  addresses  for  S: 

function  V'  =  Q(  .V,  n) 

A  =  S_addresses(  r.  .s) 

if  »  >  2 

for  k.s  =  ( ft  —  2)  :  —  1  :  1 
stride  =  rn.s 

block  =  j'"---'1-’  i  block  length 
coll  =  0 

for  k\  =  1  :  stride 

col  1  =  (k i  -  1 )  *  block 
for  k  =  1  :  r 

for  j  —  1  :  block 

B[:,col2)  =  A(:,col  1) 
co/1  =  co/1  +  1 
coll  =  co/2  +  1 
end 

co/I  =  co/I  +  ( stride  -  1 )  *  block 
end 
end 
A  =  B 
end 
end 

for  row  =  0  :  r  -  1 

Y{ row , : )  =  A'(  row,  A(  row, : ) ) 
end 


4.2.  Twiddle  recipes 

Every  twiddle  matrix  D  is  diagonal,  so  it  operates  on  a  data  vector  as  a  point-to-point  vector  multiply. 
Given  some  permutation  matrix  P,  a  new  twiddle  matrix  PDPr  is  equivalent  to  a  rediagonalizing  of  the 
vector  shuffle  of  thediagonal  of  D,  that  is,  PDPr  =diag(P*diag(D)).  (This  is  a  Matlab  notation:  diag() 
pm  the  diagonal  of  a  matrix  in  a  vector,  and  puts  a  vector  in  the  diagonal  of  a  matrix.)  Since  we  want 
to  erform  shuffles  within  pipeline  arrays,  we  reshape  the  twiddle  .V-vcctor  diag(D)  as  an  r  x  .*  pipeline 
at.  y  D,  just  as  we  originally  reshaped  the  data  vector.  Then  we  shuffle  the  pipelined  twiddles,  to  effect  the 
eqm valent  of  the  vector  shuffle-P*diag(D).  So  we  interpret  the-PD.P  r  operator  as  an  in-pipeline  shuffle 
of  the  pipelined  twiddles  D,  which  are  then  in  position  to  operate  on  the  pipelined  data  .V  directly  by  point- 
to-f  >int  multiplication,  Y  =  D.  *  X.  (As  mentioned,  the  data  will  actually  be  twiddled  frame-by-frame  in 
the  pipelined  implementation.) 

We  will  interpret  the  twiddles  expressed  in  (3)  this  way.  Each  twiddle  function  below  returnsLUn./^  * 
array  D  of  twiddle  factors  (the  actual  twiddling  of  the  data  is  not  included): 
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function  =  Dslow_twiddlcs(r,.s) 

=  cxp(-2xi/r) 
t  =  0 

for  j  =  0:  (r  -  1) 

for  t  =  0  :  ( /  -  I ) 

DatoJ:,l)  =  (l.u£w“ . t4r-l)fc)r 

t  =  t  +  1 
end 
end 


function  Djast  =  Dfast_twidd!es(  r,  .5) 

Wj  =  exp(—2-rci/r) 

t  =  0 

fort  =  0  :(/•'-  1) 
for  j  =  0  :  (r  -  1) 

Dfa,t(:,t)  =  (  . w<r-1)j)T 

t  =  t  +  1 
end 
end 


The  inverses  of  D,iow  and  Djasi  are  just  their  complex  conjugates,  and  are  generated  simply  by  replacing 
Wj  by  uj 1 .  For  stages  j  =  1  :n  (counted  down  from  n ),  we  generate  pipelined  twiddles  fj  by 

function  fj  =  T.twiddles( r,  .s,  j) 

{jjj  =  ex/;(2Ti/rJ  +  l ) 

J  u,rJ+l 

for  t  =  0  :  ( r  -  I )  !  direct  sum  loop 
t|  =  k  •  rJ_l 
for  p  =  0  :  ( r  -  1 ) 

fj(/>,<1)=  J* 

end 

<1  =  <1  +  1 
<2  =  l\  +  W"1 

for  t  =  <1  :  <2  1  fill  next  column  from  last 

end 

end 

if  j  <  n 
<2  =  rJ 

fort  =  0  :  (iV/r-'+l) 

/)=<2 
12  =  t  •  H 
I.  =  0 

for  to  -  <1  '•  h 


II 


=  I'ji'.J)  !  copy  columns 

:  -  !  F  1 
end 
end 
end 


The  re 't  of  the  twiddle  arrays  can  now  be  defined  in  terms  of  the  shuffles: 


=  s-'(/^L.) 

^1,, 

=  Cslowf  D'x!ow  ) 

=  Cslow<  Df It ) 

Tj 

=  S-'(tj) 

n 

=  Cslow(  Tj ) 

7,'., 

if  1  <  j 

<  n 

D'j 

=  D»low.*Vy>D-sit 

end 

o'L 


5.  Implementation  issues 

In  this  section  we  describe  issues  that  arise  when  the  phase-rotation  FFT  is  implemented  on  a  real  parallel 
system.  In  particular,  we  describe  implementation  approaches  for  the  radix-2  FFT  on  the  iWarp  system. 
The  main  result  is  a  scalable  implementation  of  the  pipelined  phase-rotation  FFT  that  runs  at  the  full  40 
Mbytes/second  rate  of  the  iWarp  physical  links. 


5.1.  iWarp 

The  iWarp  is  a  private-memory  multicomputer  developed  jointly  by  Intel  and  Carnegie  Mellon  [1,2],  iWarp 
systems  are  2-ditnensional  tori  ofiWarp  nodes,  ranging  in  size  from  4  to  1024  nodes.  Each  node  consists  of 
an  vVarp  component ,  up  to  1 6  Mbytes  of  off-chip  local  memory,  and  a  set  of  8  unidirectional  communication 
lin.  that  physically  connect  the  node  to  four  neighboring  noues. 

The  iWarp  component  is  a  VLSI  chip  that  contains  a  processing  agent  and  a  communication  agent.  The 
processing  agent  is  a  general-purpose  load-store  microprocessor,  centered  around  a  128  x  32-bjLrcgisler 
file,  that  runs  at  a  maximum  rate  of  20  MFLOPs.  The  local  memory  is  accessed  at  a  rate  of  1 60  Mbytes/sec. 
Each  link  runs  at  40  Mbytes/sec,  for  a  maximum  aggregate  bandwidth  of  320  Mbytes/sec  per  node. 

The  key  feature  of  the  iWarp  is  its  communication  system,  which  is  summarized  in  Figure  4.  Each 
communication  agent  contains  a  set  of  20  hardware  FIFO  queues.  Each  queue  can  hold  up  to  8  32-bit 
wor  's.  iWarp  nodes  communicate  with  other  nodes  using  unidirectional  point-to-point  structures  called 


node  0 


node  1 


Figure  4:  iVVarp  communication  concepts. 


pathways.  Each  pathway  is  a  sequence  of  queues.  Pathways  can  be  created  and  destroyed  dynamically  at 
runtime.  Figure  4  shows  a  pair  of  such  pathways. 

Data  traveling  along  a  pathway  passes  from  queue  to  queue  automatically ,  without  disturbing  the 
computations  on  intermediate  nodes.  For  example,  in  Figure  4,  data  items  traveling  over  the  pair  of 
pathways  do  not  disturb  the  computation  on  node  1.  The  latency  from  queue  to  queue  is  small,  ranging 
from  100-300  nanoseconds. 

Multiple  pathways  can  share  the  same  link.  For  example,  in  Figure  4,  two  pathways  share  the  link  from 
node  I  to  node  2.  In  this  case,  the  pathways  share  the  link  bandwidth  in  a  round-robin  fashion,  one  word  at 
a  time.  If  only  one  pathway  is  sending  data  over  a  link,  then  it  gets  the  entire  link  bandwidth.  If  multiple 
pathways  arc  sending  data  over  a  link,  then  the  link  can  be  utilized  at  the  full  40  Mbytes/sec,  and  each 
pathway  is  guaranteed  a  proportional  fraction  of  the  bandwidth. 

User  programs  can  directly  access  the  queues,  one  word  at  a  time,  by  reading  and  writing  special  registers 
in  the  register  file  called  gates.  To  an  iWarp  instruction,  a  gate  is  just  another  register  in  the  register  lile.  The 
important  point  is  that  a  program  can  read  or  write  a  word  in  a  queue  with  the  latency  of  a  register  access. 
A  single  instruction  can  read  and  write  up  to  4  words  from  queues,  with  a  maximum  aggregate  bandwidth 
of  160  Mbytes/sec.  Gates  can  be  accessed  directly  from  user-level  C  programs. 


5.2.  Mapping  strategics  on  iWarp 

The  problem  is  to  develop  a  mapping  of  the  flowgraph  in  Figure  1(d)  to  an  iWarp  array.  The  simplest 
mapping  strategy  is  to  assign  each  flowgraph  node  to  a  unique  processor  node  of  a  linear  arrTiy,~roiiti7 
the  flowgraph  arcs  through  this  array,  and  then  embed  the  resulting  linear  array  in  the  iWarp  toms.  This 
approach,  called  the  PHASF.5  mapping  because  it  uses  5  iWarp  nodes  for  each  FFT  stage,  is  shown  in 
Figure  5(a). 

Each  iWarp  node  in  PHASE5  executes  a  small  node  program  that  implements  its  llowgraph  operator. 
Each  twiddle  node  ( D')  repeatedly  reads  a  complex  number  from  its  input  pathway  (via  the  gates),  multiplies 
it  by  'he  appropriate  twiddle  (precomputed  off-line  using  the  recipes  in  Section  4.2),  and  sends  the  result  to 
its  output  pathway  (again,  via  the  gates).  Each  shuffle  operator  (.V)  repeatedly  reads  a  complex  data  item 
Pom  it*  -nput  pathway,  stores  it  in  memory,  and  uses  the  appropriate  shuffle  index  (again  precomputed 
off-line  using  the  recipes  in  Section  4.1)  to  send  an  appropriate  double-buffered  data  point  to  the  output 
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Figure  5:  Strategies  for  mapping  one  stage  of  the  FFT  onto  a  linear  array. 

(a)  PHASE5  mapping,  (b)  PHASE3  mapping. 

pathway.  The  kernel  node  (F)  repeatedly  reads  two  complex  numbers  from  its  input  pathways,  performs 
the  radix-2  DFT  kernel  operation,  and  outputs  two  complex  numbers  to  its  output  pathways. 

Another  approach,  the  PHASE3  mapping,  combines  the  twiddle  and  shuffle  operators  on  a  single  node, 
as  shown  in  Figure  5(b),  so  that  each  stage  requires  3  nodes  instead  of  5  nodes.  As  we  shall  see,  the 
communication  and  computation  throughputs  of  the  two  mappings  are  identical.  The  advantage  of  the 
PHASE3  mapping  is  that  it  is  more  node-efficient,  requiring  fewer  nodes  per  stage  than  the  PHASE5 
mapping.  The  advantage  of  the  PHASE5  mapping  is  its  simplicity.  Each  node  is  assigned  exactly  one 
operator  from  the  flowgraph. 

Figure  G  shows  a  working  implementation  of  a  I6K-point  radix-2  phase-rotation  FFT  on  a  64-node  iWarp 
array  at  Carnegie  Mellon.  The  large  squares  arc  iWarp  nodes,  labeled  with  the  corresponding  operator  and 
stage  number.  The  small  squares  are  queues.  The  arrows  are  iWarp  pathways.  The  implementation  is  based 
on  the  PHASE3  mapping  from  Figure  5(b).  Each  of  the  14  FFT 'stages  uses  3  nodes,  with  an  additional  3 
nodes  for  the  parallel-pipeline  digit-reversing  step  at  the  end. 


5.3.  Performance 

While  the  details  are  beyond  the  scope  of  this  paper,  each  iteration  of  each  node  program  in  the  PHASE3 
and  PHASE5  mappings  runs  in  at  most  8  clocks.  At  the  peak  rate  of  40  Mbytes/sec,  each  link  can  produce 
and  consume  a  32-bit  floating-point  number  every  2  clocks.  Further,  each  data  point  in  the  pipeline  is  a 
complex  number  consisting  of  a  pair  of  32-bit  lloating-poinl  words.  As  a  result,  each  pathway  requires 
exactly  half  of  the  available  link  bandwidth.  Since  each  link  is  shared  by  two  pathways,  and  since  the 
iWarp  communication  agent  gives  each  pathway  an  equal  share  of  the  link  bandwidth,  without  disturbing 
the  computations  on  intermediate  nodes,  each  link  is  fully  utilized.  The  result  is  a  radix-2  FFT  that  runs 
at  the  full  40  Mbytes/scc  rate  of  an  iWarp  link,  regardless  of  the  number  of  points  in  the  FFT!  Since  each 
sample  consists  of  8  bytes,  the  FFT  runs  at  a  constant  rate  of  5  Msamples/sec.  Given  a  sufficient  number  of 
nodes,  the  iWarp  phase-rotation  FFT’s  will  produce  arbitrarily  large  FFT's  at  this  rate.  Perhaps  even  more 
important,  the  performance  is  the  same  on  smaller  FFT’s. 

Another  way  to  characterize  the  performance  of  the  PHASE3  and  PHASE5  mappings  is  by  its  com¬ 
putational  throughput,  expressed  as  millions  of  lloating-point  operations  per  second  (MFLOPS).  However, 
there  is  a  subtlety  involved  in  using  MFLOPS  as  a  performance  measure.  The  iWarp  phase-rotation  ITT 
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Figure  6:  16K-point  pipelined  phase-rotation  FFT  running  at  40 
Mbytes/sec  (350  MFLOPS)  on  iWarp 


performs  performs  16  floating-point  operations  per  iteration  per  stage  (2  adds  and  4  multiplies  by  each 
twiddle  operator,  and  4  adds  by  the  kernel  operator).  But  the  standard  formula  for  computing  FFT  MFLOPS 
is  5.Y  log  N  floating-point  operations  per  N-point  FFT  [3],  which  implies  10  floating-point  operations  per 
iteration  per  stage.  Therefore,  in  order  to  do  fair  comparisons  with  other  FFT  algorithms,  we  must  compute 
the  phase  rotation  performance  using  the  standard  of  10  floating-point  operations  per  iteration  per  stage, 
even  though  the  phase-rotation  FFT  is  actually  performing  16  floating-point  operations  per  iteration  per 
stage. 


Since  each  node  program  executes  its  computation  in  at  most  8  clocks,  and  since  each  clock  is  50 
nanoseconds,  each  stage  of  the  iWarp  phase-rotation  FFT  runs  at  a  rate  of 


10  fp  operations 
8  clocks 


1  clock 


50  nanoseconds 


1  x  !09  nanoseconds 
I  second 


='  25  MFLOPS 


for  a  total  performance  over  all  of  the  log.Y  stages  of  25  log  .Y  MFLOPS.  For  example,  the  16K-point 
FFT  in  Figure  6  achieves  a  measured  performance  of  25  x  14  =  350  MFLOPS  (single  precision)  on 
iWarp.  As  a  point  of  comparison,  a  highly  optimized  16K-point  FFT  has  been  measured  at  237  MFLOPS 
(double  precision)  on  a  single-processor  Cray  Y-MP  [3.  p.  I  14).  The  numbers  are  not  directly  comparable 
because  of  the  different  floating-point  precisions,  but  they  do  suggest  that  the  absolute  performance  of  the 
phase-rotation  FFT  on  iWarp  is  quite  good. 


6.  Concluding  remarks 

We  have  described  an  improved  version  of  the  Whelche!  pipelined  phase-rotation  FFT.  developed  recipes 
for  generating  the  appropriate  twiddles  and  shuffle  indices  off-line  and  directly  in  terms  of  (he  parallel 
pipeline,  outlined  mapping  approaches  for  the  radix-2  case  on  pthe  iWarp  parallel  computer,  and  presented 
measured  performance  results  of  an  implementation  on  iWarp. 

The  improvement  on  the  original  phase-rotation  FFT  is  significant  in  that  it  eliminates  a  potential  pipeline 
bottleneck  during  the  digit  reversing  step  at  the  end.  The  twiddle  and  shuffle  recipes  should  be  helpful 
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to  the  programmer  who  wants  to  implement  the  pipelined  phase  rotation  FFT.  The  iWarp  implementation 
validates  a  simple  and  realistic  approach  for  building  scalable  pipelined  FFT’s  on  a  programmable  parallel 
system.  Further,  the  implementation  demonstrates  that,  given  a  balanced  parallel  computer  architecture 
with  word-level  access  to  the  communication  links,  it  is  possible  to  build  FFT's  that  run  at  the  full  link 
bandwidth  of  the  links,  even  when  the  FFT's  are  relatively  small. 
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